This study focuses on embodied agents that can follow natural language instructions to complete complex tasks in a visually-perceived environment. Existing methods rely on a large amount of (instruction, gold trajectory) pairs to learn a good policy. The high data cost and poor sample efficiency prevents the development of versatile agents that are capable of many tasks and can learn new tasks quickly. In this work, we propose a novel method, LLM-Planner, that harnesses the power of large language models (LLMs) such as GPT-3 to do few-shot planning for embodied agents. We further propose a simple but effective way to enhance LLMs with physical grounding to generate plans that are grounded in the current environment. Experiments on the ALFRED dataset show that our method can achieve very competitive few-shot performance, even outperforming several recent baselines that are trained using the full training data despite using less than 0.5% of paired training data. Existing methods can barely complete any task successfully under the same few-shot setting. Our work opens the door for developing versatile and sample-efficient embodied agents that can quickly learn many tasks.
translated by 谷歌翻译
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer.github.io
translated by 谷歌翻译
Shape can specify key object constraints, yet existing text-to-image diffusion models ignore this cue and synthesize objects that are incorrectly scaled, cut off, or replaced with background content. We propose a training-free method, Shape-Guided Diffusion, which uses a novel Inside-Outside Attention mechanism to constrain the cross-attention (and self-attention) maps such that prompt tokens (and pixels) referring to the inside of the shape cannot attend outside the shape, and vice versa. To demonstrate the efficacy of our method, we propose a new image editing task where the model must replace an object specified by its mask and a text prompt. We curate a new ShapePrompts benchmark based on MS-COCO and achieve SOTA results in shape faithfulness, text alignment, and realism according to both quantitative metrics and human preferences. Our data and code will be made available at https://shape-guided-diffusion.github.io.
translated by 谷歌翻译
Temporal exponential random graph models (TERGM) are powerful statistical models that can be used to infer the temporal pattern of edge formation and elimination in complex networks (e.g., social networks). TERGMs can also be used in a generative capacity to predict longitudinal time series data in these evolving graphs. However, parameter estimation within this framework fails to capture many real-world properties of social networks, including: triadic relationships, small world characteristics, and social learning theories which could be used to constrain the probabilistic estimation of dyadic covariates. Here, we propose triadic temporal exponential random graph models (TTERGM) to fill this void, which includes these hierarchical network relationships within the graph model. We represent social network learning theory as an additional probability distribution that optimizes Markov chains in the graph vector space. The new parameters are then approximated via Monte Carlo maximum likelihood estimation. We show that our TTERGM model achieves improved fidelity and more accurate predictions compared to several benchmark methods on GitHub network data.
translated by 谷歌翻译
The human ear is generally universal, collectible, distinct, and permanent. Ear-based biometric recognition is a niche and recent approach that is being explored. For any ear-based biometric algorithm to perform well, ear detection and segmentation need to be accurately performed. While significant work has been done in existing literature for bounding boxes, a lack of approaches output a segmentation mask for ears. This paper trains and compares three newer models to the state-of-the-art MaskRCNN (ResNet 101 +FPN) model across four different datasets. The Average Precision (AP) scores reported show that the newer models outperform the state-of-the-art but no one model performs the best over multiple datasets.
translated by 谷歌翻译
复杂的伤口通常会面临部分或完全损失皮肤厚度,从而通过次要意图愈合。它们可以是急性或慢性的,可以发现感染,缺血和组织坏死以及与全身性疾病的关联。全球研究机构报告了无数案件,最终涉及严重的公共卫生问题,因为它们涉及人力资源(例如医师和医疗保健专业人员),并对生活质量产生负面影响。本文提出了一个新的数据库,用于自动将复杂伤口自动分类为五个类别,即非缠绕区域,肉芽,纤维蛋白样组织和干性坏死,血肿。这些图像包括由压力,血管溃疡,糖尿病,燃烧和手术干预后的并发症引起的复杂伤口的不同情况。该数据集(称为ComplexWoundDB)是独一无二的,因为它可以从野外获得的27美元图像中的像素级分类,即在患者的房屋中收集图像,并由四名卫生专业人员标记。用不同的机器学习技术进行的进一步实验证明了解决计算机辅助复杂伤口组织分类问题的挑战。手稿阐明了该地区未来的方向,在文献中广泛使用的其他数据库中进行了详细比较。
translated by 谷歌翻译
随着网络攻击和网络间谍活动的增长,如今需要更好,更强大的入侵检测系统(IDS)的需求更加有必要。 ID的基本任务是在检测Internet的攻击方面充当第一道防线。随着入侵者的入侵策略变得越来越复杂且难以检测,研究人员已经开始应用新颖的机器学习(ML)技术来有效地检测入侵者,从而保留互联网用户对整个互联网网络安全的信息和整体信任。在过去的十年中,基于ML和深度学习(DL)架构的侵入检测技术的爆炸激增,这些架构在各种基于网络安全的数据集上,例如DARPA,KDDCUP'99,NSL-KDD,CAIDA,CAIDA,CTU--- 13,UNSW-NB15。在这项研究中,我们回顾了当代文献,并提供了对不同类型的入侵检测技术的全面调查,该技术将支持向量机(SVMS)算法作为分类器。我们仅专注于在网络安全中对两个最广泛使用的数据集进行评估的研究,即KDDCUP'99和NSL-KDD数据集。我们提供了每种方法的摘要,确定了SVMS分类器的作用以及研究中涉及的所有其他算法。此外,我们以表格形式对每种方法进行了批判性综述,突出了所调查的每种方法的性能指标,优势和局限性。
translated by 谷歌翻译
视频稳定在提高视频质量方面起着核心作用。但是,尽管这些方法取得了很大的进展,但它们主要是在标准天气和照明条件下进行的,并且在不利条件下的性能可能会差。在本文中,我们提出了一种用于视频稳定的综合感知不良天气鲁棒算法,该算法不需要真实数据,并且只能在合成数据上接受培训。我们还提出了Silver,这是一种新颖的渲染引擎,可通过自动地面提取程序生成所需的训练数据。我们的方法使用我们的特殊生成的合成数据来训练仿射转换矩阵估计器,避免了当前方法面临的特征提取问题。此外,由于在不利条件下没有视频稳定数据集,因此我们提出了新颖的VSAC105REAL数据集以进行评估。我们将我们的方法与使用两个基准测试的五种最先进的视频稳定算法进行了比较。我们的结果表明,当前的方法在至少一个天气条件下的表现差,即使在一个具有合成数据的小数据集中培训,我们就稳定性得分,失真得分,成功率和平均种植方面取得了最佳性能考虑所有天气条件时的比率。因此,我们的视频稳定模型在现实世界的视频上很好地概括了,并且不需要大规模的合成训练数据来收敛。
translated by 谷歌翻译
我们介绍了队列舒适模型,这是一个新框架,用于预测新乘员如何看待其热环境。队列舒适模型利用从样本人群中收集的历史数据,这些数据具有一些潜在的偏好相似性,以预测新居民的热偏好反应。我们的框架能够利用可用的背景信息,例如物理特征和一次性的登机调查(对生活尺度的满意度,高度敏感的人尺度,五个个性特征)以及新乘员以及生理和环境传感器的测量值与热偏好响应配对。我们在两个公开可用的数据集中实施了框架,其中包含来自55人的纵向数据,其中包括6,000多个单独的热舒适调查。我们观察到,使用背景信息的队列舒适模型几乎没有变化的热偏好预测性能,但没有使用历史数据。另一方面,使用队列舒适模型的每个数据集占用人群的一半和三分之一的占用人群,而目标居民的历史数据较少,同类舒适模型将其热偏好预测增加了8〜 \%,平均为5〜 \%与对整个乘员人群进行训练的通用模型相比,某些乘员最多可容纳36点\%和46〜%。该框架以数据和站点不可知的方式呈现,其不同的组件很容易根据乘员和建筑物的数据可用性定制。队列舒适模型可能是迈向个性化的重要一步,而无需为每个新乘员开发个性化模型。
translated by 谷歌翻译
现有3D网格模型的新型纹理合成是迈向现有模拟器的照片现实资产产生的重要一步。但是现有方法固有地在2D图像空间中起作用,这是从给定的摄像头的角度来看3D空间的投影。这些方法采用摄像头角度,3D模型信息,照明信息并生成逼真的2D图像。为了从另一个角度或照明产生一个逼真的图像,我们需要每次更改参数时进行计算上昂贵的远程通过。同样,很难为可以满足时间约束的模拟器生成此类图像,图像的序列应相似,但只需要根据需要更改照明的观点。该解决方案不能直接与搅拌机和虚幻引擎等现有工具集成。手动解决方案是昂贵且耗时的。因此,我们提出了一个称为Graph生成对抗网络(GGAN)的新系统,该系统可以生成纹理,可以将其直接集成到给定的3D网格模型中,该模型使用Blender和Unreal Engine之类的工具,可以轻松地从任何角度和照明条件进行模拟。
translated by 谷歌翻译